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ABSTRACT 

Assessments should improve performance by providing 
usable feedback, and should not merely audit it. Problems with 
educational accountability policies stem from a flawed view of 
student assessment. Intellectual excellence cannot be obtained via 
one-time mandated tests composed of proxies for real challenges. 
Common standards should be developed for use in evaluating local 
standards and measures, not common tests. A more performance-based 
accreditation process is proposed, with policies that induce schools 
and colleges to explicitly benchmark local work, chart progress over 
time, and give incentives for meeting high performance standards. 
Authentic educational tests simulate problems of knowledge use found 
in professions and after formal education. Assessments must teach 
students that tasks, criteria, and standards found in schools and 
colleges are appropriate for all rational inquiry and fruitful 
intellectual life. Assessments with flexible and context-sensitive 
opportunities reveal student expertise. Two vignettes for f causing 
policy reform and 10 guidelines for developing a consistent system of 
assessment are given. Outcome-based education and site-based decision 
making ensure that all local testing from Kindergarten through 
graduate school (K-GS) involves the worthiest tasks and best 
exit-level challenges, and adapts to all grades. A seamless 
K-12-graduate school system includes; authentic tasks and standards 
linking different system stages that are known to all students and 
teachers at lower levels and recur throughout their work; and 
authentic standards and measures that are thoroughly explained, 
taught, and practiced with constant opportunity for revision and 
improvement so that schools and students are genuinely culpable for 
substandard performance. The appendixes provide guidance about 
assessment practice from various sources in terms of general 
principles and recommendations, specific suggestions, scoring scales 
for writing activities, literacy profiles (reading), and "work 
requirements" in literature study and chemistry. (35 references) 
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TOWARD ONE SYSTEM OF EDUCATION: 
ASSESSING TO IMPROVE, NOT MERELY AUDIT 1 



Is it really proper to say that we have an education "system"? I believe we do not have one 
— and not because we lack a national curriculum. Rather, the long-standing incoherence in 
education stems from four failures: tests that are not built out of exemplary tasks; our 
penchant for impatiently mandating uniformity (instead of requiring quality performance from 
appropriately varied syllabi and tests); the use of one-shot tests in comparing different student 
cohorts instead of assessing the same cohort's progress over time, in reference to authentic 
standards; and our myopia about why traditional tests cannot, in principle, assess the ultimate 
outcomes of a good education. 

Hie heart of the argument is the view that any assessment should improve performance by 
providing usable feedback, not merely audit it Trying to obtain intellectual excellence 
through one-time mandated tests composed of proxies for real challenges is a contradiction. 
It would be more effective policy to develop common standards for use in the evaluation of 
local standards and measures, not common tests. The state can mandate high standards (and 
provide better incentives for meeting them) without imposing standardized tests, as many 
countries long ago realized — better, more performance-focused accreditation — not more 
superficial testing providing only the illusion of accountability. 

Schools and universities then would be appropriately free to develop their own assessments, 
while oversight agencies would reserve the right to audit tests and performance results for 
technical soundness, fairness and effectiveness. When comparisons of schools are necessary, 
we can develop agreed-upon indicators and calibration procedures to provide data without 
imposing high-stakes, superficial tests that corrupt the school's aims and autonomy. 2 

We would thus be devising policies that induce schools and colleges to do what all 
professions and the best companies do in the quest for improvement: explicitly "benchmark" 
local work, chart progress over time and provide incentives for meeting high (and higher) 
standards of performance. In the case of education, this means ensuring that tests are 
"authentic" so that they mirror or simulate the problems of knowledge-use found in the 
professions and at the end of formal education. Those tasks involve research leading to a 
dissertation and its defense, the "test" of apprenticeship or effective grappling with realistic 
case studies (as found in most law, business and medical schools). We would have a system 
when we ensure that younger students are confronted with such maximally enabling tasks — 
not to be confused with tests designed to certify control over common knowledge and 
orthodox ideas, tests that present more of a hurdle and an easy sorting device rather than 
being standaid-revealing. 

Since life's tasks, our students and our school customers properly differ, we need assessments 
that provide flexible and context-sensitive opportunities for revealing student expertise. But 
we must also demand that local work be done to high standards. Such a view strikes many 
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policy makers as paradoxical — how can there be standards without mandated standardization 
of tests? 



Two Vignettes for Focusing Policy Reform 

Consider for a moment the characteristics of an interesting assessment strategy found now 
within our educational world. There, the challenges are not at all standardized; indeed, they 
are by design fully personalized, allowing the student free rein as to topic and direction. Nor 
are students subject to uniform tasks required of all; no one thinks this odd or "invalid." 
Further, contrary to common practice, the test is never secret the assessment, in fact, is 
centered on the students' creativity and thoughtfulness, knowledge crafted into products and 
performances of their own design. No student must show a uniform mastery of orthodox 
knowledge to "earn" this right to create — even though his or her school is mainstream and 
not at all "alternative." 

In these institutions, the schedule suits the learner's pace and talents, so that the student is 
only assessed when ready. Here, the relationship between teacher and student is not at all 
adversarial. The teacher is the student's ally and guide through the challenges of the 
assessment, not the enemy to be "psyched out" Here, the assessors do "sit with" the student, 
literally and figuratively, probing the student's ideas. The assessor is, in fact, obligated to 
understand the student's point of view to validly test the student's grasp of ideas — a far cry 
from the typical "gotcha!" test 

Perhaps by now you have seen through this ironic picture. In a conference designed to 
broaden our typically narrow perspective on assessment, it is worm noting that kindergartners 
and graduate students in our best schools and universities have much in common. All the 
previous conditions of assessment "pply to the two extreme points of our educational world. 
At both extremes we standardize the quality of performance expected, not the tasks. One 
might even say that at both ends we focus far more on the student's intellectual virtues than 
the correctness of their words. Nor would we dream of glibly comparing these students in 
merely a normative way. Each piece of work, be it a drawing or a dissertation, is examined 
for what it reveals about the learner's habits of mind and ability to create meaning, not one's 
"knowledge" of "facts." Even the details of test administration are parallel: resources are not 
only allowed in the "test," but we often want to determine whether the student wisely uses all 
available resources. Test "security" thus is a foolish and counter-productive strategy in each 
case. 

Indeed, it is only at the beginning and end of formal education that we acknowledge the truth 
of how intellectual accomplishment is best judged: through an evocative examination of the 
student's use of knowledge and by "subjective" interaction of mind and mind — dialogue. 
The essential question is not: Is the student correct? But rather Do the student's ideas, 
arguments and products work — i.e., do they effectively and gracefully achieve the student's 
intention? Is the student making progress in meeting apt standards of craftrnanship and rigor? 
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Somehow we have gone wrong in the vast middle of schooling, violating our sense of what 
"education" and "assessment" really mean. What must we do to recapture the radically 
common-sensical view that any test of intellectual ability must be interactive? Why do we 
not see that typical tests bear no relation to the challenges facing the student in later worlds 
and thus lack the kind of validity that should matter most? And what must a policy maker do 
to rectify the situation when such dysfunctional habits run so deep? 

Yes, "habits." Habits, not rational policy choices, brought us to this point Unthinking 
reliance on past practice sustains large-scale testing as a solution to performance problems, 
not evidence or a lack of "research" or money for alterr atives. Real reform begins when we 
see that more imposed testing as a response to our cdi national problems is like any other 
addictive and rationalized behavior. The second vignette to keep in mind in contemplating 
new policy, therefore, is the tale of the "Emperor's New 'Clothes." 

In the story, rascals pose as tailors "weaving" a suit of tK "finest" cloth for the king, earning 
riches by the fashioning of an illusion — an illusion not only about the garment itself but also 
about their skill in serving the king. The king's nakednes? remains unseen because of the 
tailors* warning: only boorish folk would fail to recognize the quality of the "incredibly fine" 
yam. And so it happens that the townspeople rationalize the nakedness and their secret 
doubt; they, like the king's retinue who fear for their honor, praise the king as he parades in 
his "finery." The king, too, knows that he is not a commoner and is sucked into the 
self-deception. It is the innocent child, unpossessed of a need to appear refined, who exposes 
the hoax. "But he has nothing on!" exclaims the child. 

Oddly enough, few people recall the story's ending — and it is the ending that shows how 
harmful unthinking habits can be. The elders initially dismiss the remark of the young 
"innocent" Eventually the truth of the child's words cut through the illusion. But while 
thinking that the now-skeptical townspeople must be right "the Emperor thought to himself, 
'I must not stop or it will spoil the procession.' So he marched on even more proudly than 
before, and the courtiers continued to carry a train that was not there at all." 3 

The tale is instructive about current testing policy on many levels. We still do not want to 
spoil the procession. Testing increases while few useful results emerge from the investment 
We are still dismissing the remarks of the "innocents." We do not look through the eyes of 
students as they prepare for and take the tests we buy to see how debilitating they are to 
intellectual engagement courage and imagination. Nor do we look through the eyes of 
employers, teachers and administrators to see how rarely they study test results to understand 
an applicant's abilities or the meaning of their errors. We are so self-deceived that one 
invariably hears in conversations on test reform: "But we made it through the system, didn't 
we ?" — as if these high-stakes tests were nothing more than the "harmless" indignities of a 
freshman initiation of years gone by. 

/ 

Multiple-choice test makers literally profit from the illusion that like the tailors* yam, all 
"fine" tests must be built with a specialist's mysterious skill Testing, rather than the very 
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common practice of assessing student performance on the tasks we value, becomes an arcane 
science that is entrusted — and apparently is entrustable — only to statisticians. 4 Critics of 
such tests fear looking like the crude folks the tailors warn their critics will be; wary 
practitioners are routinely made to feel ignorant of the true "finery" in test validity and 
reliability. 

The unreal simplicity of the typical test item is much like that of the king's nakedness: so 
obvious as to make one feel that some complex set of standards must surely render the test 
substantive. Like the townspeople in the story, we end up talking as if the real achievements 
we value and ought to be measuring were being directly observed in detail on multiple-choice 
tests. Even the supposed necessity of secure tests becomes "obvious" to everyone — as if all 
the important challenges and criteria in life (getting employed, writing a thesis for a doctorate, 
obtaining a driver's license, winning the Super Bowl or submitting a winning engineering or 
graphics design bid) were routinely kept secret from prospective performers. 

The inevitable then happens. Rather than having policy incentives that would improve 
classroom assessment in a manner appropriate to instructional aims, teachers and professors 
are encouraged to imitate the psychometric "tailors." Even while talking of the foolishness or 
harm of such tests, educators usually end up employing their own inadequate versions of 
them — the true sign of the tests* mythic rather than rational power. The call for more 
mandated, valid and securely administered standardized tests then naturally increases as local 
assessment (thus, performance) deteriorates; the vicious circle continues. 5 Bring in the 
tailors! Let the king march more proudly! 

Unlike in the story, then, our assessment "garments" still seem sublime rather than illusory. 
The child's voice — common sense — still remains largely unheard or dismissed in policy 
circles. Thus, cramming takes precedence over pride in one's work and joy in being 
genuinely tested; schools and colleges worry more about test results than whether scores 
represent thoughtful or thoughtless mastery. We have all lost sight of the fact that education 
succeeds when we provide students with the joy of thinking deeply and well about important 
things — the joy that is one of the chief incentives for staying in school and really mastering 
the "basics." * 



Toward a Consistent System of Assessment 

With the point of this paper to provoke sharpened discussion and to carefully revisit the 
"obvious" policy answers, let me offer some brief propositions intended to advance thinking 
beyond the hackneyed. These are postulates for further discussion; I do not propose to justify 
all these notions here, but they do have value as policy foundations. 

1. Tests ultimately teach more than they measure what was taught. They should therefore be 
composed of the tasks and qualities we most value. What we test is what gets taught The 
tests we design are the de facto aims of education; they must therefore embody our standards. 
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Can students do research, fashion compelling arguments, bring closure to discussions, plan 
and execute a project, infer conclusions from confusing data and write engaging prose or 
poetry? Let our assessment systems be primarily built out of such tasks, modified for age 
and experience as necessary. 

This is the argument for "authentic assessment" But the postulate further suggests that it is 
the obligation of the assessor to evoke from the students the full extent of their knowledge — 
even, perhaps, not resting content with the first student answer given. We are obligated, I 
believe, to base important decisions on an accurate and complete profile of the student* s 
/ intellectual accomplishments — as opposed to a single score based on the student's success or 
failure in answering a small set of imposed test items. 

Put simply, we should be standardizing the (high) standards for judging (often idiosyncratic) 
intellectual performances, not standardizing items for use on inherently superficial tests. 

The policy implication is clean any "audit" of performance should be derived from this 
primary record of (local) work and achievement, not through the imposition of a common, 
indirect test If indirect measures must be used, they should be carefully scrutinized for 
obtrusive or counter-productive influence on learning and teaching. 

The collection of student work can be selected and edited as necessary to serve policy 
questions and needs. Why have we consistently failed to use the basic statistical tool of 
effective sampling of student performances and productions, just as we do in the artist's 
portfolio or the doctor's residency? Why haven't policy makers asked educators to "agree to 
agree" locally on the kinds of tasks worth mastering and to report on the results over time? 
(California's matrix sampling of student writing, Vermont's sampling of student portfolios 
and Connecticut's sampling of high school seniors in performance-based tasks in math and 
science are notable recent examples that illustrate the point in the K-12 state policy world). 

Operating under such a premise would get us beyond two utterly dysfunctional habits of 
formal education — the view that students must first learn and be tested upon "the basics," 
out of context before earning the right(?) to tackle our most valued tasks; and the view that 
large-scale tests must be composed of proxies for authentic work on economic and 
psychometric grounds. But at what cost in dropouts and to the integrity of our schools and 
universities? At what cost to the capacity of schools and colleges to produce scholars — i.e., 
students who can do their own research well? And where has the debate been over what land 
of margin of error is tolerable in assessment? Other countries have historically been content 
to use human judges in scoring, even in high-stakes examinations. Are we using a 
micrometer when a ruler would do? Is the demand for such reliability causing our intellectual 
values to be co-opted in the name of "precision" and "objectivity"? 

2. The measure of a successful assessment system is the degree to which it improves student 
and school performance while validly and reliably measuring performance. Assessments 
should be assessed, in other words, not merely for their validity but also for their 
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effectiveness in improving performance. Credible and worthy measures and standards in the 
"resT world do more than monitor performance. They improve performance; standards and 
expectations are both raised, whether we are talking about Nintendo, engineering or musical 
productions. 

Why are school standards and expectations railing? In part because of our tests. Large-scale 
tests often have to be "lowK*Uing" to fully capture and discriminate the full range of 
normative, not exemplary, performance. If school and college faculties are increasingly asked 
to "teach to the (low-level) test," they become like the myopic patient in fear of pain who at 
the yearly physical thinks only of psyching out the doctor on the few tests given in the office 
visit Like the patient, our faculties come to confuse cause and effect because of our policies 
and pressures; they, too, fixate on short-cuts to get the indicators up — instead of making a 
long-term commitmeiit to the daily regimens that produce (intellectual) healthfulness and 
better long-term results. 

A solution is for states to formulate and support assessment policies that make systemic 
effectiveness part of the essential parameters of assessment design and use. 6 A good 
assessment system assists in improving local performance because it is in complete harmony 
with the intended outcomes. Put a different way, we need assessment policies at the local 
and state levels to ensure that the design, purchase and use of assessment strategies support 
the aims of the organization. 

At the very least, this view suggests a policy like those found in many European countries 
where local educators are free to design their own assessments, but the instruments have to 
meet standards set by regional boards. Similarly in scoring: through the "moderation" 
process (frequently used in Great Britain and Australia and proposed for the national exams 
of the New Standards Project in this country), local scores are re-calibrated as necessary to 
ensure that common standards are upheld. Such a procedure is the only way to obtain both 
high standards and ownership of the assessment policy by local educators. 

"Improvement" can only be measured if we assess the same cohort of students many times, in 
reference to stable standards. At the very least, we should be using pre-test/post-test 
measures for holding schools and colleges accountable. To compare one year's cohort to 
another's, given the rates of mobility and changes in demographics that effect schools and 
colleges, is to forfeit insight into the determinants of accountability. 

3. Standards are set by establishing benchmarks and models: quality performances at 
exemplary tasks. We must assess both "input" as well as the quality of student "output" — 
i.e., the qualities of tasks assigned and products received. But as important as it is to choose 
more authentic tasks, we must begin to see that choosing tasks is not sufficient to set 
standards. The two senses of "standards" — worthiness of task undertaken versus adequacy 
of result — are independent, captured in the differentiation made in music, diving and 
gymnastics — the difficulty of a task and the quality of a performance. The current 
discussion of "standards" is thus hopelessly confused because it conflates the two meanings 
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and often collapses them into one statistical artifact (See the recent NAEP math standards 
task force report for an example of this mistake at the national policy level; NAEP has 
selected test items to represent standards). 

Standards would require us to design tasks that make the student perform or produce a 
product Quality can only be observed when the student must use knowledge. Have we 
completely forgotten Bloom's taxonomy — never mind common sense? Bloom and his 
colleagues argued that the upper-level capacities could by and large only be assessed by the 
student being asked to fashion "unique products, performances or discourse." Simply because 
someone can point to the right answer on an upper-level math item on the NAEP test reveals 
nothing about the thoughtfulness and effectiveness whereby they habitually do their work. 

What we need in both assessment and curriculum design is agreement on the important set of 
rich tasks to be mastered and samples of work that embody the standards set for those tasks. 
This would be the intellectual equivalent of designing the Decathlon, and fixing appropriate 
qualifying scores for each course and level of education grounded in some sound 
"benchmarking" or equating process. (This is the true spirit of portfolio assessment whereby 
we collect samples of work from students on all the important genres and tasks over time to 
assess for habitual achievement; it is also the only way to ensure that testing is valid since we 
cannot assess control over all important tasks in one test sitting). 

In Kentucky's new state assessment system (through the work of the Council on Performance 
Standards), there would be broad faculty consensus on and extensive use of an 
"exemplary-task bank" from which assessments might be constructed. The tasks would be 
performance and production challenges, developmentally modified as necessary, deemed 
worthy by both scholars and professionals. 

Such a view means thinking of "tests" in the same sense in which rock-climbing tests the 
climber — a revealing measure of both training and the essential habits of mind. Such 
perspective is not the luxury of the elite. Empathy with odd, unfamiliar or alien views; 
ability to construct plausible cases in ambiguous situations; self-adjustment; sustained, 
effective and responsive analysis; capacity to argue effectively but tactfully, etc., are the 
hallmarks of any thoughtful person. 7 Until our assessment system both evokes and requires 
these virtues for success on the tests, we will be vitiating liberal education. 

4. Performance standards are not test norms, nor are they arbitrary cut scores. They are 
anchored by "benchmarks" — specific, apt, exemplary performances or products. Quality is 
not an abstraction or a statistical artifact Standards for rigor, precision, creativity, implied 
persistence, etc., are set by examples — samples of actual r^uctVperforrnances that 
exemplify the qualities we seek. Performance standards are empirically induced, in other 
words, by a combination of research, observation and wise judgment We "set the standard" 
at the top of our scoring scale through the wise choice of "anchors" — samples of work that 
we believe to be genuinely excellent and apt models for emulation. Like the bullseye or 
prize-winning essay, a real standard supplies the essential element of effective self-assessment 
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— usable feedback — where I see for myself how my work compares with the 
standard-setting work. Thus, a real standard empowers the student* it enables one to 
effectively self-adjust performance. 

The key questions, then, for policy become: Who chooses the benchmarks and by what 
process? Wnat is a justifiable standard, one mindful of our highest expectations, matters of 
equity and sensitivity to the goals and contexts of schools and colleges? What *re appropriate 
expectations on the developmental path toward those (stable) standards versus arbitrary grade- 
or "exit-level" expectations that are set in a vacuum? 

What we now refer to as "standards" in testing are mere cut scores — useless for a 
meaningful examination and improvement of performance. If there is no qualitative 
difference between a 59 and a 61, what does it mean to say that 60 is "passing"? This is not 
a problem limited to local testing and grading. The Advanced Placement exams are scored 
with reference to historical norms of the distribution of past results. We really have little idea 
whether this year's percentage of Advanced Placement students who get 5s, 4s and 3s are 
writing essays as good as those of their equals of 20 years ago because the Head Reader 
refers to patterns of old scores (linked to the standard curve) when turning raw scores into 
final scores. This is a critical issue if we seek to increase the number (and, presumably, alter 
the general calibre) of the students taking and passing the exams in a way that doesn't lower 
standards. 

To set stable and evocative standards: 

5. We should routinely assess student work from the viewpoint of stable exit-level criteria 
and standards. This is the only conceivable wise way to chart genuine progress over time. 
The British and Australians have tried such a system (see the appendix for two samples) and 
it warrants our immediate attention. Only when we routinely score current work against 
exit-level standards can we be sure that local grades are sufficiently reliable and correlated 
with the ends of education; only then can we know what constitutes not merely "normal" 
growth but effective progress — i.e., the "slope" of improvement, in reference to authentic 
standards. (Think of how athletes and musicians — and video game players — fix their 
sights on exemplary performances to gauge and direct their own progress). 

This is also the only way to get beyond the sham of age-grade equivalents that are really 
arbitrary timetables for charting norms. At present, our testing rewards native talent and 
speed in learning instead of providing a revealing and flexible record of (necessarily varying) 
student progress in meeting standards. 

It follows that: 

6. An important problem in our educational assessment system is the absence of standards 
for faculty grading of student work. There are now neither shared criteria nor stable 
performance standards for ensuring adequate inter-rater reliability and consistency across 
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faculty members, nor are there policies that require faculties to "agree to agree" on grading 
standards within a "tolerable" margin of error. The British and others have addressed this 
problem through the "moderation" process, and we need to emulate it (as Vermont plans to do 
in its portfolio process). 

While faculties at all levels, and especially at the college level, may balk at such a 
requirement on grounds of professional autonomy, such an argument is specious: no teachers 
have the right to design invalid and/or unreliable assessments, nor do they have die right to 
design tt.ts that do not finely articulate with the stated mission, goals and outcomes of the 
institution. 

7. Quality work is incompatible with one-event, predominantly external, testing. If our aim 
is not merely to hope for uniformly excellent work from all students but evoke and receive it 
on a consistent basis, we must provide the student with multiple opportunities to be re-tested 
on the same essential tasks — just as we do in all performance-based teaching and learning. 
Or is our aim to set standards through invidious comparisons only, whereby we fail large 
numbers of students in the name of high standards, i.e. steep "curves"? (Note that this 
dysfunctional view of assessment is not found at the kindergarten or Ph.D levels). 

Because so much of both local grading and large-scale testing is typically norm-refer .iced, 
we have trouble imagining "high standards" being met by all students — without a corruption 
of standards. That a wide diversity of performance should be expected after an education is a 
view that is not found, thank God, in flight school or in medical school where uniform 
success is the only standard. This is similar in vocational programs. A vocational teacher 
told me that he required all his students to get 100% on the tests for use of a radial saw 
before allowing unsupervised use. Think of a typical 60% being an acceptable performance. 

If we had a genuinely standard-referenced system, we could and would happily have all 
students earn A's (as long as we audit to be sure that the grades genuinely reflect what the 
judges claim they mean). This is what I think the job of state departments of education and 
accreditation agencies should be: setting standards for local standard-setting and conducting 
audits to ensure that local criteria and standards are honored in local assessment within 
"tolerable" margins of error across faculty members. This is not implausible — the New 
York State Regents Exams have operated this way for 100 years. 

Demanding and getting quality work is a local, daily affair. Excellence is only obtained by 
successive approximations toward (the known) "standard" performances. This means that a 
demand for excellence and one-shot tests, with no opportunity for effective feedback and 
revision, are Inconsistent Quality control is the avoiding of sub-standard performance — 
impossible if the only assessing that counts occurs once, late in the year and relies on a 
relatively unpredictable sample of some de-contextualized body of knowledge, through the 
device of a "secure" test 
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In fact: 



8. The demand for glib comparability of all schools and colleges, using a few scores on 
indirect tests, is not accountability at all. Of all the issues discussed at the conference for 
which this paper was written, nothing generated more immediate objections and gasps of 
surprise than the question: "Who really wants such comparability?" 

Comparisons are inevitable in judging the quality of any performance. They also aid the 
student in improving — but only if the comparisons offer useful feedback about how to 
improve. I know of no walk of life except education whexe the sole measure of one's 
achievement is a one-shot superficial test, yielding one aggregate number that cannot assist 
the learner in improving. Whether we consider apprentice athletes, doctors, pilots or soldiers, 
we find that an array of statistics and usable feedback are presented to profile the capacities 
and accomplishments of the student 

Consider a simple example to show how bunded we are on this matter. It is mathematically 
indefensible to aggregate a baseball player's many statistics into one aggregate score; "hits" 
and "runs batted in" and the others are incommensurable. So, too, in the intellectual world — 
"intellectual initiative," "persistence," "powers of analysis" and "consistency and precision of 
results" are different, valued capacities. Why, then, is it toierable in testing and reporting to 
reduce a year's worth of complex work to one score on a proxy test? Why don't we see that 
a score for de-contextualized "knowledge" and "skill" is fundamentally misleading if we don't 
assess the judgment and care in the employment of knowledge and skill or value the task that 
the test item demanded? A report of aggregate student scores, used to compare schools 
glibly, tells us little when curricula are diverse and differently organized from school to 
school. 

Comparisons in education have almost always been invidious and of little value for 
stimulating improvement The reasons are obvious — we test what is easy to test and not 
what is essential; large-scale tests tend to be immune to local curricula, hence local gains; 
rankings rarely deviate much from year-to-year results. Hie scores usually say more about 
lucky gene pools than value-added achievement by the school or university. 8 Worse, 
norm-referenced tests are designed to exaggerate the differences between students. In test 
pilots, questions are thrown out if everyone gets them right or wrong; the aim is to maximally 
discriminate. Such a mechanism ensures that an institution or student is extremely unlikely to 
change position in the ranking. What does that do to incentive, never mind fair 
accountability? The built-in stability of conventional test scores and their "curves" tend thus 
to yield self-fulfilling prophecies about test results, hardening our prejudices and fatalism 
about change. 

The solution to this mess was proposed 80 years ago: a test of performance in which the 
performers compete against known standards, not other performers only. Alas, so-called 
"criterion-referenced" tests have been a sham up to this point because of their proxy and 
secure nature, and because the "criteria" were applied to item selection and not student 
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performances. Let us enter the modem world of performance assessment by devising tests in 
which the tasks, scoring criteria and standards are benchmarked and enticing. But that means 
challenging the most basic aspect of traditional testing — test "security." 

9 The aim of getting quality work from all students is utterly incompatible with "secure" 
tests and mysterious criteria and standards. Quality work in all domains always depends 
upon accurate self-assessment and self-adjustment in reference to specific, known standards. 
How can we expect students or schools to improve if the assessments rely on a fairly 
unpredictable sample of unknown, generic test questions? Only deep-seated anduntlunking 
habits blind us to an obvious fact: pervasive secrecy in asses- lent is counter-productive — 
and immoral. A few fanciful vignettes may be useful in jolting us to see the harm: 

A Consider what our response as adults would be to a job evaluation process in which 
the employer could do what test-makers routinely do: pick a few tasks from the 
hundreds we had learned and performed over the years, without our knowledge or 
consent, and assess our performance on a one-hour "secure" paper-and-pencil test 
(Worse, imagine your employer relying on a testing company to assess your 
performance through the use of generic multiple-choice tests.) It is telling that, for 
adults, the practice would be regarded as unfair, inappropriate and likely illegal. Why 
does it not seem so when dealing with students? 

B. Imagine if student musicians had to wait until test day to know the music they would 
be playing in concert Assume, too, that students play their instruments through 
microphones connected to other rooms where judges could listen but students could 
not hear themselves play. Weeks later the student would receive a single score telling 
them where they stood relative to all the clarinet or trumpet players in the state, and a 
computer print-out summarizing the stylistic and technical areas they should work on. 

C. What if baseball were played all season long, but the pennant races were decided 
using one-shot tests with one aggregate score designed by statisticians? Thus, on the 
last day of the season, specially constructed, secure tests would be given to each 
player, composed of a sample of drills and game situations. The pennant races would 
be decided by each year's (new) test and its results. 

Performance and performance improvement is impossible when the information received is so 
limited _ a limit imposed by prior secrecy and the vagaries of test-maker sampling. But 
more than the first two vignettes, this last one reveals how unwittingly obtrusive the designers 
of one-shot high-stakes secret tests can be — even if their aim is to be merely helpful 
statisticians. The test designer seeks to design a valid assessment of all the important 
sub-skills as specified by others; secrecy is required to enable simple parts of a complex game 
to be efficiently used to draw valid inferences. Yet we easily see how such a system would 
corrupt coaching and the game itself. Not only the student-players, but the teacher-coaches 
would be robbed of the capacity to concentrate on excellent play in such an assessment 
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10. To meet the most demanding exit-level standards requires that we practice meeting them 
throughout our career — whether we are thinking of a year's worth ofschootwork or a 
K-graduate school career. We must recapture what every coach knows: drill at the simple 
and constituent parts of a complex task is necessary but not sufficient We must continually 
prance the ultimate, performance as well as the constituent parts to achieve mastery. All the 
more so if the ultimate performance is the creative, rigorous and personalized work of 
developing one's own ideas into knowledge. 



Assessment and Opportunity 



This last proposition goes to the heart of any common-sense view of a "seamless" education 
system. The point of organizing schooling around exit-level intended outcomes couldn't be 
simpler, design curricula backwards around the achievements you wish all students to master. 
It follows that our assessments must do more than reveal whether the learner has mastered 
what one isolated teacher happened to just finish teaching. The examining of students at each 
stage must embody and point toward the tasks, criteria and standards that represent the future 
goals. Assessment must always be constructed out of "enabling" tasks. 

The ultimate intellectual challenge of formal schooling is the dissertation and its defense (or, 
more generally, the rigorous development and full explication of one's own ideas and 
arguments). We should therefore anchor our K-graduate school system by this 
"standard-setting" task — not because all or even most students will earn Ph.Ds, but because 
ensuring maximal mastery of the ultimate task requires that students continually practice and 
be tested on it (even if in simplified form) from the beginning of learning. No novelty here: 
look at Little League, ballet or chess. 

This is not as far-fetched as it might seem. In the Advanced Placement Art Portfolio Exam, 
students submit evidence of their choosing that reveals the breadth of control over important 
genres and studies and an in-depth project focused on one problem, theme or style. They also 
supply written explanations of their intentions with the pieces chosen for inclusion. In effect, 
they are judged on the effectiveness of the realization of their intentions (as with the 
dissertation), not someone else's view of what subject matter they should "know." The 
instructions to students and judges makes this clear: the assessors look for pieces that "work 
in their own way." Similarly, I have seen high school English teachers anchor their portfolio 
assessment in the student's choice of "major pieces" and their self-assessment — a contract 
. ith the self, as it were, to produce quality. 

We can put this talk of "seamlessness" and building "ownership" of the assessment process in 
a very different way — in the language of equity. If we are ever to qualify students for the 
upper-level task of quality production (and thus maximally qualify them for a fruitful 
adulthood), we must require that all tests throughout the system reveal and point the student 
and teacher in those directions. To cast the point in the common-sense language of quality 
control: "Quality is not what our tests say it is; quality is what the customer wants and 
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requires from us." 10 Educators are quick to dismiss this kind of remark with a wave of the 
hand and some derisive comments about businesspeople and glass houses, etc. But the point 
is well taken, particularly if we remember that each division of schooling has mi internal 
customer — the next level(s) of schooling. How rare it is for middle school teachers to see 
our best high school papers and tests to know how to equip their students for success! 
Worse, how rare it is for isolated high school teachers who are prone to covering content to 
see how dysfunctional their own view of teaching, learning and assessing often is — M the 
point is to maximally qualify students for complicated, interesting, intellectual work. 

The following example, a freshman European history exam at Harvard, illustrates "what the 
good-college-as-customer wants'*: 

1. (30 minutes) The student must choose eight of 16 sets of items and explain 
why one of the items in each set does not belong. 

a. Waterloo, Trafalgar, Austerlitz 

b. Night of August 4, General Will, terror 

c. Montesquieu, Madison, Calhoun 

2. (45 minutes) Choose one question. 

a. Imagine yourself Jean-Jacques Rousseau, but living in the early 20th 
century. Write a brief review and evaluation of Freud's work in light of 
your own theories of society. 

b. Imagine yourself Karl Marx, living half a century later. Write a brief 
evaluation of the programs of the Fabian socialists and the American 
reformers such as T. Roosevelt to present to the Socialist International. 

c. "Women's history can only be understood as part of general historical 
development" Do you agree? Why or why not? 

3. (45 minutes) Choose one. 

a. "Both Germany and the U.S. fought decisive wars of unification in the 
1860s, but in Germany, the landlords retained great power after the war, 
while in America, the landlord classes lost decisively." Discuss. 

b. Compare and contrast the causes of the two world wars. 

c. Would the European economies have developed differently without the 
rote of the non-European peoples? 
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4. 



(One hour) Choose one. 



a. Is the history of Western society in the last 350 years or so a history of 
progress? (Make aire you define "progress.") 

b. "Until 1945, the history of Europe is a history of warfare: preparing for 
it, conducting it, concluding it This was the price of a continent of 
nation states, and democracy did not improve the situation. " Discuss. 



Clearly, something more than mastery of dates and names is wanted here. Rigorous and 
creative analysis is required — and a good amount of style in answering is no mere frill. 
Observe, too, that students have significant choice, true of most good college exams. But of 
most importance in this assessment are the implicit and unspecified standards and criteria of 
performance expected — a paradox of high-quality, upper-level education. We (by-and-large 
correctly) assume that students in good colleges ought to understand the kind and quality of 
the answers that are required here. In an excellent college, where students have long 
practiced the construction and refinement of historical analyses, such vague instructions as 
"Discuss" or "Compare" are (or ought to be) sufficient A good pre-collegiate education 
prepares you for such exams. But how prepared are students who graduated from the average 
system — the "content-coverage school district"? 

They clearly are not well-prepared, given our non-system of educational tribalism, where each 
test is built out of parochial myths or eccentric tastes about what really matters. The 
graduates of Recall and Regurgitate High School are in for a rude shock — an immora l shock 
— that results from our "system" being no system at all. Why aren't most high school 
teachers obligated to know the kinds uf tasks and grading standards that are required for 
success in good colleges? Wnat policies might induce them to anchor their work in 
exemplary upper-level work — as successful coaches routinely do in all schools? What 
untold harm is done to students who find out too late that they are unprepared for their future 
aspirations because their schools felt no need to go beyond mandated tests to scout out and 
reveal the standards in force at the higher levels of education and employment? Why are 
these matters so often left to test companies who are more interested in the most generic and 
cost effective (i.e., marketable) test rather than exemplary and enabling assessment? 

I am talking about maximally qualifying students for admission to the most worthy programs, 
not ensuring their admittance to any particular place. 11 Nor should we be dissuaded from 
this task by looking at current admissions tests. The issue is better protection of students and 
schools from the inevitable harm of using exclusively secure and indirect tests to rank and 
sort efficiently. (It is also unconscionable that students and schools pay the entire fee for 
colleges' admissions practices.) 

Suppose we then think of our job in the K-12 arena as maximizing the likelihood that all 
students can handle the tasks found in the best universities and places of employment What 
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will all students need from our assessment system? For one, our examining must involve 
recursive practices more routinely. We will want to know whether students are making 
progress in performing at important tasks. Such a view means breaking another bad habit 
viewing assessment as the testing of what was taught — after teaching and learning are over. 
Assessment must be both educative and enabling, effectively communicating the kinds of 
tasks and the quality of performance we eventually expect students to handle when they are 
adults. 

It follows that we should routinely assess student performance against exit-level tasks and 
standards as opposed to age-cohort expectations and norms only, as we do now. We should 
do more routinely what Illinois does in its state writing assessment: compare student papers 
from three different grades against the same standards and criteria. We should do what many 
vocational programs and some high schools now do: assess their students against entry-level 
job or college performance standards. We then ask: what kinds of "scaffolding" or "training 
wheels" would have to be provided to younger, less experienced students to give them guided 
practice in handling such tasks? This is one question we should be addressing regularly in 
our assessment and syllabus designs but rarely do. Do we, then, have a system! 

Other countries have done a bit better in this matter than we have — ironically, through 
external examinations. Leaving aside the wisdom of our having national exams (in a country 
with no national curriculum???) we ought at least to appreciate the informative and 
standardizing impact of high-stakes, syllabus-linked exams that directly relate to entrance 
requirements in college, as occurs in Canada and most European countries. In Alberta, for 
example, half a student's grade for the senior year is his or her exam score; the other half is 
the teacher's grade for the work of the year. One noteworthy result Teacher grades are very 
reliable, and there is extremely high correlation between local grades and exam scores. One 
can quickly see why. A teacher would be pilloried who gave consistent 90s to students who 
ended up getting 40s on the exam, given the 50/50 grading system. Known, shared and 
locally used standards are in everyone's interest, in other words — all the more so since the 
exam scores in most cases count toward admission and student "majors" at all Alberta's 
colleges and trade schools. We severely underestimate the power of students to rrret high 
standards in this country because we so rarely inform them well ahead of time about those 
(unbending) standards, and we so rarely make them real in their day-to-day work as they do 
in Alberta. 

One ironic virtue of living in a world of external examinations is that the teacher becomes the 
student's ally, not the judge, jury and executioner and "guess-what-I-want" tester (we see this 
same relationship in good Advanced Placement courses in America). 12 Students are then far 
better equipped with knowledge of standards, criteria and tasks. All teachers in Alberta are 
required to teach students about prior exam questions, show anchor papers and share the 
scoring criteria for the essays. Consider this example from a Canadian provincial exam in 
history for use in scoring the required essays (in addition, the student receives the four 
scoring rubrics for each set of traits scored): 
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Marios are awarded depending upon now well a student meets the following 
requirements: — - 

1. Defense of Position 

a. Evidence of a Position 

• Is the writer's position evident? 

b. Logic and Persuasiveness 

• How well-chosen are the examples? 

• How well does the writer make the examples serve the position? 

• Are the arguments based on scholarship and reason rather than unsupported 
assertions? 

• Are the arguments based on valid assumptions? 

2. Discussion of Value Positions 

a. Identification of Value Positions 

• Are two or more value positions indicated? 

b. Thoughtfulness 

• How adequately developed is the discussion of alternative values? 

• What depth of understanding of the issue is demonstrated? 

3. Presentation of Examples or Case Studies 

a Relevance 

b. Accuracy 

c. Comprehensiveness 

4. Quality of Language and Expression 

a. Organization 

b. Convention 

c. Syntax and Vocabulary 

Please ensure that all students have prior access to this scoring guide. 
Conclusion: Honoring the Purposes of Liberal Education 

The problems with all educational policy concerning accountability grow out of a flawed view 
of student assessment Shouldn't assessment even large-scale assessment be designed to 
assist the student by embodying standards and offering usable feedback, not merely 
"measuring" performance through proxies? By assuming that mandated tests should primarily 
serve overseers, not teachers and learners, our students are routinely "tested" but never 
challenged, understood or inspired 
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To "assess" is to "sit with" the learner, as the word's roots reveal. Our task is to find out 
whether students can demonstrate the ability to employ knowledge effectively and elegantly. 
A multiple-choice test is a simplistic audit, and, as in business, an audit of the books bears 
little relation to the quality of the company's performance. We have become Wind to the 
harm of substituting these machine-scored items for authentic intellectual performance. 

The harm stems from the form of the instruments and their effect on teaching. The format of 
current tests is a structure at odds with modem views of knowledge." Genuinely "good" 
answers are not "correct" (as multiple-choice tests require — and ultimately teach) but 
justified or well supported. Real problems grow out of idiosyncratic courses and contextual 
concerns; they are not generic "items." Answers to real questions are not self-evidently right 
or wrong, but require analysis — maybe even dialogue — for their soundness or plausibility 
to be established. By not evoking or requiring student production and by not using judges to 
determine the adequacy of student reasoning or depth of understanding, standardized tests 
prevent us from assessing the cardinal virtues of importance to both educators and employers: 
craftsmanship, thoughtfulness, initiative, persistence and rigor in intellectual work. 

An education should provide all students with an intellectual voice, empathy and autonomy as 
a thinker. Traditional assessment points in the opposite direction. It never asks us to produce 
our own products. It never asks us to use our judgment or justify our conclusions. It never 
enables us to explain our seemingly "wrong" answers. It never asks us to respond to the 
arguments of those who disagree with us. Yet, aren't these the intellectual challenges and 
achievements we value? Whether or not one ever attends higher education, all schooling — 
hence, all assessment — should be infused with these core values. Only then will all students 
understand what our ultimate expectations really are — and why they matter. 

The impatient arxmg us will argue that there is no viable alternative in the large-scale policy 
arena to efficient, indirect tests. Yet, basic questions about this urge to mandate superficial 
measurement go perpetually unasked Just what is being measured, at a penny per student, so 
that a school is genuinely assessed and accountable for what is in its control? Most such tests 
are less sensitive to value-added achievement than to the demographics and aptitude of the 
population. Why, for example, isn't the "marketplace" (of employer Wrings, upper-level 
educational admissions and vocational and civic success) a better mechanism for comparing 
schools and offering incentive for local change? With no national curriculum and an 
appropriately diverse set of programs and schools, what can possibly be worth comparing on 
a single test? 

There is a painful irony in the call for more mandated tests as the answer to our woes. The 
move toward a more centralized "planned" education via state and national policy is occurring 
just as the rest of the Western political world embraces the wiser view that renewal depends 
upon liberated, local enterprise. 14 Such mandates pose unseen threats to the most 
sought-after graduate and undergraduate programs in the world since higher education is 
about the freedom of students and faculty to do their own work, but to do it well — quality, 
without either imposed orthodoxy or uniformity. The thoughtless imposition of tests will thus 
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do more than impoverish the will and capacity of school faculties to raise expectations and 
hold high(er) standards. It will undermine the very idea of the university as a place for 
promoting and rewarding sound, free thinking. 

I thus seek the ultimate in what is now called "outcome-based" education and "site-based 
decision-making" — ensure that all local testing, across the K-graduate school continuum, be 
composed of the worthiest tasks, leading toward our best exit-level challenges, adapted to all 
grades as necessary. "But few students go on to upper-level college and graduate work!" 
Yes, but how many students are prevented from doing so by K-12 syllabi and tests thoroughly 
at odds with thoughtful inquiry and effective production? Coherent and authentic assessment 
is an equity and empowerment issue just as much as it is an accountability issue. 

The well-trained u£per-level student knows what far too few younger students ever get to 
know: experts disagree for good reasons; important questions are not so much answered as 
explored effectively; knowledge is not so much "imparted" as effectively induced, 
constructed, used, tested, extended, criticized or transformed into unfamiliar truths. All 
assessment in education should reveal that good learning is not revealed by what "right" 
answers" one knows but by whether one can provide well-reasoned and appropriate results. 
Students must learn through assessment that all real-world "testing" of ideas occurs through 
interaction — a "dialogue" between people, ideas, specific situations and contextual 
constraints that compel the creative adaptation of the more general truths and skills learned in 
schooling. 

We then teach a lesson that is morally as well as intellectually empowering: the judge, too, is 
subservient to standards and criteria. Grades and scores must not represent the apparent or 
mysterious tastes of judges. Otherwise, we teach students to passively end up trying to figure 
out "what they want" A system built out of secure multiple-choice questions, with no 
recourse to dialogue with assessors, is inquisitional. It violates the spirit of liberal education 
and ensures that many students never make it to the end — except those who already trust 
adults. 

But it is not sufficient to design assessment practices that enable students to understand the 
standards by which they will be judged as adult thinkers. Our assessment practices must 
teach students that the tasks, criteria and standards they encounter in schools and colleges are 
the appropriate ones for all rational inquiry and a fruitful, intellectual life. This means 
undoing the dogmatic, anti-intellectual effect of standardized, multiple-choice testing: the 
acquired, cynical view that there is an orthodox set of prepositional truths that everyone "just 
has to know," that all questions have fixed, clear, official and unquestionable answers, and 
that machine-scored questions are "objective" and human-judged questions are "subjective" — 
i.e., the scoring is too soft and unreliable to use as solid evidence. 

Students must learn from assessment the truth about human judgment: the grounds of sound 
assessment are indeed objective — even if and when judges disagree — because the criteria 
are neither arbitrary nor ineffable. We have succeeded in educating students, in fact, when 
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APPENDIX A 

t Principles of Assessment for Better Learning 
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Assessment Blue-printing and Task Design 

The following pages contain design suggestions for direct assessment by performance, 
product, project, exhibition or portfolio. These "templates" suggest the types of 
situations, simulations, rotes and problems that can be used to make tasks authentic 
and engaging for any subject matter or age group. 

Common to all the ideas are three essential principles: 

(1) "Higher-order" thinking and acting requires that students produce "unique 
products or performances" (in the words of Bloom) 

(2) Task ideas can come from the modification of existing high-quality 
instructional activities — including such non-scholastic activities as Scouting, 
Od/ssey of the Mind, vocational simulations and competitions, etc. 

(3) Assessment tasks should reveal the types of challenges actually encountered in 
the field when professionals are called upon to use knowledge effectively, 
imaginatively and in context — i.e., the products and performances are 
sensitive to audience, purpose, particular constraints of the setting, cost/benefit 
considerations, etc. 

These ideas are meant to be more than interesting, optional provocations. The 
assumption is that districts and schools would develop blue-printing policies for how 
all assessments should be constructed to ensure that they are maximally authentic, 
"higher-order" and articulated with system standards and performance targets. 
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"Higher-order performance verbs" for 

Discern a Pattern 

Grasp Purpose & Reach Audience 

Empathize with the Odd 

Pursue Alternative Answers 

Achieve an Intended Aesthetic Effect 

Exhibit Findings Effectively 

Polish a Performance 

Lead a Group to Closure 

Develop and Effectively Implement 
a Ran 

Design, Execute and De-bug an 
Experiment 

Make a Novice Understand What 
You Deeply Know 

Induce a Theorem or Principle 

Explore and Report Fairly on a 
Controversy 

Lay Out "Cost-Benefit" Options 

Assess the Quality of a Product 

Graphically Display and Effectively 
Illuminate Complex Ideas 

Rate Proposals or Candidates 

Establish Principles 

Make the Familiar Strange 
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use in assessment blue-printing 

Infer a Relationship 

Facilitate a Process and Result 

Create an Insightful Model 

Disprove a Common Notion 

Reveal the Limits of an Important 
Theory 

Successfully Mediate a Dispute 
Thoroughly Rethink an Issue 
Shift Perspective 

Imaginatively and Persuasively 
Simulate a Condition or Event 

Thoughtfully Evaluate and 
Accurately Analyze a Performance 

Judge the Adequacy of an 
Appealing Idea 

Accurately Self- Assess and 
Self-Correct 

Communicate in an Appropriate 
Variety of Media or Languages 

Complete a Cost-Benefit Analysis 

Question the Obvious or Familiar 

Analyze Common Elements of 
Diverse Products 

Test for Accuracy 

Negotiate a Dilemma 

Make the Strange Familiar 
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b. Authentic assessment — 
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4. Focuses on the students' ability to produce a gaaUjv. product and/or performance. 

5. Involves de-mystified and non-secret tasks, criteria and standanis; allows for thorough 
preparation and accurate self-assessment by the student 8 

6. Relies on trained assessor judgment in reference to clear and appropriate criteria fas 
opposed to those most easily observed or scored). P P * <aS 

? ' t^^ST*?** °5 interacti0 ' ,s I*™*™ assessor and student Focuses on the 
student s abthty to jusfifv answers and respond to follow-up or probing questions 

8. Involves fiaB£rn£of response and behavior, consistency of performance: emphasis is 
on consistency of quality, habits of mind. empnasists 

9. Calls upon different forms of communicating and means of displaying masterv - in 
an mtegrahve "performa^e" or sc. of products, e.g., an oral re^up^by » 
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APPENDIX B: 

LONGITUDINAL SCORING SCALES FROM AUSTRALIA 

AND GREAT BRITAIN 



1. Proposed Scoring System for K-10 Assessment (U. K.) — Writing 
Level 

Pupils should be able to: 



Description 



Use pictures, symbols or isolated letters, words or phrases to communicate 
meaning. 



^Z^^^ piCC f' ° f ***** usin « ^mplete sentences, some of 
toem demarcated with capital letters, periods or question marks. 

a^cc^ SCqUenCeS ° f ^ ° r ima ^ ned events coherently in chronological 

Write stories showing an understanding of the rudiments of story structure by 
establishing an opening, characters and one or more events 
Produce simple, coherent non-chronological writing. 



^^ d TK ndently ; ^ ° f usin * com P Iete sentences, mainly 

demarcated with capitals, periods and question marks. 

Shape chronological writing by beginning to use a wider range of sentence 
connectives than "and" and "then." 

Ifin^en^g mPlCX ^ bey ° nd simp,e events Md with « 

Begin to revise and re-draft in consultation with the teacher or other children in 
tne class, paying attention to meaning and clarity as well as checking for mines 
such as correct use of tenses and pronouns. P 



0 



Produce pieces of writing in which there is a rudimentary attempt to oresent 
subject matter in a structured way (e.g., title, paragraphs? vetSfcto 55? 

TZZEZ *" ^ 0Pe " L * " Setting - Characters - a of e™* 
Organize non-chronological writings in otderly ways 

diffcreM — — charac,eristic 

Attempt independent revising of their own writing and talk about the changes 
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Write in a variety of forms, e.g., notes, letters, instructions, ^ PoernUbr 
T^p of pmpoL, e.g.. to plan, inform, explain, enterttun. express attitudes or 

P^Toieces of writing in which there is a more successful attempts 

natter in a structured way, e.g., by lay-out. ne^ngs, 
SSta Xeh sentence punctuation is almost always acourate.y used. 
CTwtoh rimple uses of tine comma are handled 
Write in standard English (except in contexts where non-standard forms are 
^^SmdoJim increasing differentiation between speech and wnttng, 
7t by using constructions which decrease repetition. 
Assemble ideas on paper and show some ability to produce a draft from them 
and to redraft or revise as necessary. 



Write in a variety of forms for a range of purposes showing some ability to 
present subject matter differently for different specified audien^s _ 
MaTe use of iterary stylistic features, such as alteration of worf order for 
emphasis or the deliberate repetition of words or ^ ^ 
Show some ability to recognize when planning, drafting, redrafting ana revising 
are appropriate and to carry these processes out 
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Produce weU-structured pieces of writing, some of ^ch hand^re 
demanding subject-matter, e.g., going beyond first-hand experience 
SaWre aisured and selective use of a wider range of gremrnattcal and 
lexical features appropriate for topic and audience. 

Show an increased awareness that a first draft is malleable, e.g., by changing 
foTiTwSSing is cast (from story to play), or by altering sentence 
structure and placement 



,0 • Write, at appropriate length, in a wide variety of forms, with assured sense of 

. ^ctSx^jcc. matter clearly and effective*. A*--*, 
sutured pieces in which relationships between successive paragraphs are 

. MaK aS'se.ective and appropriate use of a wide range of^cal 
Tonsttuctions and of an extensive vocabulary. Sustain the chosen style 
eonstotly. Achieve felicitous or striking effects, showmg evKience of a 
personal style. 
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From the Victoria, Australia, Literacy Profiles: Reading 

.* t^^^^T^ " Sing ^ P"*"*** "hands" for grades K-10 
Teachers keep mnmng records, based on observation and tasks assigned 

Reading Band A 



Holds book the right way up 

On request, indicates the beginning 

and end of sentences 
Refers to tetters by name 
Responds to literature (smiles, claps, 

listens intently) 

Reading Band B 

Takes risks when reading 
Asks others for help with meaning and 
pronunciation 



Turns pages front to back 
Locates words, fines, spaces, letters 
Identifies known, familiar words 
Joins in familiar stories 
Shows preference for particular books 



Uses pictures for clues to meaning of text 
Predicts words 

Makes a second attempt at a word if it 



0 - . iwo*es a second atte 

Recognizes root words within other wools doesn't sound right 

uSsS?" 8 ^ "* " ,6ft RcteUs W"* «*«e 



Reading Band D 

Selects books to fulfill own purposes 
Substitutes words with similar 

meanings when reading aloud 
Themes from reading appear in art work 
Reads materials with a wide variety 

of styles and topics 



Reading Band F 



States main idea in a passage 
Self-conects, using knowledge of language 

structure or sound-symbol to make sense 

of a word or phrase 
Uses vocabulary and sentence structure 

from reading in written work as well as 

talk 



Selects relevant passages to answer 

questions 
Maps out plots and character 

developments in novels 
Makes connections between texts 
Discusses styles used by different 

authors 

Offers reasons for response provoked 
by text 

Justifies own appraisal of text 



Formulates questions and finds relevant 

information from reading 
Varies reading strategies according to 

purposes of reading and nature of text 
Discusses author's intent 
Forms generalizations about a range of 

genres including myth, short story 
Offers critical opinion or analysis of 

reading passages in discussion 
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Reading Band I 

B^lcxtoalinn^andandcrt^ Inte^ analogy. aUegory and p^e to 

Idemtoairf^^ ^ each illation of text 

s.gnificanmrf^l Analyzes the cohesiveness of text 
Discusses and wntes about the X** 

author's bias 
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APPENDIX C 
VICTORIA, AUSTRALIA "WORK REQUIREMENTS" 

Work Requirements — Literature 
Keeping a Reading Journal 

• Personal responses to texts 

• Notes from group and class discussions 

• Short responses to textual issues and questions 

• List of texts read, with comments 

Developing a Portfolio 

• Three "finished responses" included for each of two units; one must be 

discursive-analytical," another "creative" 

• At least one response in oral form 

Presentation of a Review of Reading 

• Based on student's independent reading 

• Presentation to class, as well as written; mindful of audience 

Writing a Text 

• Finished piece should be read by an audience other than the teacher 

Exploiing Fiction in Television, Film or Radio 
Producing an Extended Response 
Interpreting a Text for Performance 
Comparing Readings 
Investigating Contexts 
Analyzing a Review 
Presenting a Written Analysis 
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^ fT ^- Chemistry (built around four units: materials, chemistry in everyday 
life, chemistry and the marketplace, and energy and matter) 

Modeling Structures 

. Construct models to represent continuous lattices, discrete ^lecules 
. Aspect and evaluate models of nuclear atom, polymers, ceramics, alloys 
. Discuss strengths and weaknesses of models 

Investigation of Waste Materials 

! BES a waste sample, e.,, contaminated 

. Itotify waste materials generated during production of a useful material, strategies 

used for dealing with waste _ aHw>nt and the ; r 

• Discuss advantages and disadvantages of methods used in waste treatment and their 

implications for continued use of material 
Investigation of Oxidation-Reduction Reactions 

. Perform a range of experiments to observe oxidation reactions, demonstrate electron 
transfer nature by constructing simple galvanic ceils rn ™ iftn 

. Design and perform an experiment which relates to metal reactivity and corrosion 
protection, evaluate the experimental design 

Other Work Requirements: 

Medk File Investigation of a Chemical of Local Importance 

Record of Reactions Product Analysis 

Investigation of an Instrument Investigation of BP^J"T 

Changing Models of the Atom Investigation of Periodic Table 

Food* Annotated Flow Charts Concept Map ...^^ 

fa^stigation of Useful Materials Investigation of Properties of Water and Atmosphere 



Laboratory activities should occupy at least 25% of each of the 4 units, St^nte shouW 
^"^1 details of lab activities in a log book. Such records should be used to prepare 
reports; in each unit students should prepare two full reports. 
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ENDNOTES 



1. This is a revised veremn of a paper presented for discussion at a conference on 
as«*nem pobcy across me educational system, sponsored and hosted byTSSfiE 
Commission of the States in June 1991 in Breckenridge, Colorado. >*™>n™ 

Z ^ Standards Project, headed by Lauren and Daniel Resnick 

and M«c Tucker, would use such a moderation procedure for developing a varie^of 
TrZi^HT " 1 " ew national ««"nina«ion system fofgradesTs and 
llrJ^wT^ T^l' ° f . COn ^ rable '""Ptudmal data bases, eslabhsned m^urfT 
by^Si fwf) £ £dUCW,0n inStitutions - <*» *f» *>™ed and analyld 

3. As translated by Naomi Lewis (1981). 

Consider, by contrast, the recent British manifesto underlying their new national 

SZvTt^ 7 m ^ ^ °" ^sroom-oased, teacher-ove^n 
assessment). The national system should employ tests [of] a wide range of modes of 
P ™? on ' <f™?°» response....a mixture of tests, practical tasks and 

2Sf2T» fT* ^ 0nler to minimizc cunicular distortion....The group has 

no doubt that a successful system depends upon teachers' confidence in it the reoT 
recommends that teachers be given more support in assessment, by ^ 

providmg them with a wider range of diagnostic tests....the tests should be so 

r^r^ rt ) seem to stadem m metcnt ^ dassroom ******* 

^Z^!! 10 ^ C ° Un ^ CS ****** rcfer to our fetish for increasingly relying on 
mumple-choice tests as "the American solution" to educational problems 



4. 



5. 
6. 



See the appendix for some sample policy statements designed to ensure that the 
purpose of assessment is honored by policy and practice. 



7. See Ewen in Grant, Elbow et al. (1979) on the competencies of the liberal 
8. 

education. 



arts. 

L^ci-n" ° 991) f ° r " detailCd aCC ° Um ° f " vaIue - a *'«i" assessment of higher 



For an excellent discission of authentic assessment and how to maximize the 

00 sch0O,ing ~ " systemic " va,idity ~ 866 * 

v™. ^ ^ eS f.' Ck !" d others ta *• Psychometric community have called 
consequential validity" - an idea so obvious that one is stunned to real™ that few 
wta this community saw the need until recently to factor the effect of testing inj 
test design _ , s usually applied to test and teaching practice but too rarelyTfte 
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views of what a .curriculum is. A critical need is to rethink what ^"^7™* 
St? » *at «e don't flunk of sfta instructions ^Ti) 
whether what was "covered" was "teamed- (U„ recalled or used in a low-tevel way). 

10. As reported in Peters (1989). spoken by a Dow Chemical quality control engineer. Cf. 
Wiggins (1991). 

11 We know the bad news: Harvard, Stanford, Howard, Earlham and others ^ ruling 

rejection." Tbe 8:1 ratio of application to acceptance or worse ^ "relevant, 
howler mCmoral obligation of aU lower-schooling ^^to eqtnp tfudents to 
be maximaUy prepared, should the "right fat letter" come on Apnl 16th. 

12 See Elbow (1986) on this point; Cf. Astin (1991) for his distinction between tests as 
knaves and tests as providing feedback, where the same point .s made. 

13. While such tests have long been criticized as 

epistemological implications ate more troublesome It is a mistaken "^"^ 
mat "knowtedRe" consists of facts and unequivocal right answers, as opposed to 
wfn-^ne^ s^ed claims and arguments. ^Jh^o^Wge^nd 
understanding embedded in traditional tests ,s a^ odds w.* ^"^ ^^ 
sciences; it harkens back to a medieval v.ew of knowledge. See "The Futility ot 
Teaching Everything of Importance" (Wiggins 1989a). 

14. This is the core of the argument made in the widely *^ J^^,^ 
(1990) Most of the press attention focuses on meir call for choice but tmu call is a 

^ed^jjtion. ThTproblem, as they see it, is that American educarron U. 
SSU iidf as long as mandates from external govermng bodies, not 
market forces working on autonomous schools, determine policy. 
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